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Abstract 



Using the ^i-norm to regularize the estimation of the parameter vector of a linear model 
leads to an unstable estimator when covariates are highly correlated. In this paper, we in- 
troduce a new penalty function which takes into account the correlation of the design matrix 
to stabilize the estimation. This norm, called the trace Lasso, uses the trace norm, which 
is a convex surrogate of the rank, of the selected covariates as the criterion of model com- 
plexity. We analyze the properties of our norm, describe an optimization algorithm based on 
reweighted least-squares, and illustrate the behavior of this norm on synthetic data, showing 
that it is more adapted to strong correlations than competing methods such as the elastic net. 

1 Introduction 

The concept of parsimony is central in many scientific domains. In the context of statistics, signal 
processing or machine learning, it takes the form of variable or feature selection problems, and is 
commonly used in two situations: first, to make the model or the prediction more interpretable or 
cheaper to use, i.e., even if the underlying problem does not admit sparse solutions, one looks for the 
best sparse approximation. Second, sparsity can also be used given prior knowledge that the model 
should be sparse. Many methods have been designed to learn sparse models, namely methods based 
on combinatorial optimization ^ [2] , Bayesian inference [3j or convex optimization 0] [5] . 

In this paper, we focus on the regularization by sparsity-inducing norms. The simplest example 
of such norms is the ^-norm, leading to the Lasso, when used within a least-squares framework. 
In recent years, a large body of work has shown that the Lasso was performing optimally in high- 
dimensional low-correlation settings, both in terms of prediction [B], estimation of parameters or 
estimation of supports 0|H]. However, most data exhibit strong correlations, with various correla- 
tion structures, such as clusters (i.e., close to block-diagonal covariance matrices) or sparse graphs, 
such as for example problems involving sequences (in which case, the covariance matrix is close to 
a Toeplitz matrix [3]). In these situations, the Lasso is known to have stability problems: although 
its predictive performance is not disastrous, the selected predictor may vary a lot (typically, given 
two correlated variables, the Lasso will only select one of the two, at random). 

Several remedies have been proposed to this instability. First, the elastic net [10] adds a 
strongly convex penalty term (the squared ^2-norm) that will stabilize selection (typically, given 
two correlated variables, the elastic net will select the two variables). However, it is blind to the 
exact correlation structure, and while strong convexity is required for some variables, it is not for 
other variables. Another solution is to consider the group Lasso, which will divide the predictors 
into groups and penalize the sum of the £2-norm of these groups . This is known to accomodate 
strong correlations within groups [12] ; however it requires to know the group in advance, which is 
not always possible. A third line of research has focused on sampling-based techniques [T31 HH HE] ■ 



An ideal regularizer should thus be adapted to the design (like the group Lasso), but without 
requiring human intervention (like the elastic net); it should thus add strong convexity only where 
needed, and not modifying variables where things behave correctly. In this paper, we propose a 
new norm towards this end. 

More precisely we make the following contributions: 

• We propose in Section [2] a new norm based on the trace norm (a.k.a. nuclear norm) that 
interpolates between the i'l-norm and the ^2-norm depending on correlations. 

• We show that there is a unique minimum when penalizing with this norm in Section [2. 2| 

• We provide optimization algorithms based on reweighted least-squares in Section [3j 

• We study the second-order expansion around independence and relate to existing work on 
including correlations in Section [4] 

• We perform synthetic experiments in Section [5j where we show that the trace Lasso outper- 
forms existing norms in strong-correlation regimes. 

Notations. Let M G JS nxp . The columns of M are noted using superscript, i.e., MW denotes 
the i-th column, while the rows are noted using subscript, i.e., M, denotes the i-th row. For 
M S Rp x p, diag(M) e M. p is the diagonal of the matrix M, while for u e W, Diag(u) e Rp x p is 
the diagonal matrix whose diagonal elements are the Ui. Let S be a subset of {1, then us is 

the vector u restricted to the support S, with outside the support S. We denote by S p the set 
of symmetric matrices of size p. We will use various matrix norms, here are the notations we use: 

• ||M||* is the trace norm, i.e., the sum of the singular values of the matrix M, 

• ||M||op is the operator norm, i.e., the maximum singular value of the matrix M, 

• ||M||f is the Frobenius norm, i.e., the ^-norm of the singular values, which is also equal to 
-\/tr(M T M), 

p 

• ||M||2,i is the sum of the of the columns of M: ||M|| 2 .i = V ||M (i) || 2 . 

i=l 

2 Definition and properties of the trace Lasso 

We consider the problem of predicting y £ K, given a vector x G M. p , assuming a linear model 

y = w T x + e, 

where e is (Gaussian) noise with mean and variance a 2 . Given a training set X = (xi, x n ) T G M. nxp 
and y = (j/i, ...,y n ) T £ K™, a widely used method to estimate the parameter vector w is the pe- 
nalized empirical risk minimization 

1 " 

w e argmin - ^(y^, w T x. i ) + A/(w), (1) 

2=1 

where £ is a loss function used to measure the error we make by predicting w T Xi instead of 
yi, while / is a regularization term used to penalize complex models. This second term helps 
avoiding overfitting, especially in the case where we have many more parameters than observation, 
i.e., n -C p. 
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2.1 Related work 



We will now present some classical penalty functions for linear models which are widely used in the 
machine learning and statistics community. The first one, known as Tikhonov regularization |16| 
or ridge regression |17j . is the squared ^2-norm. When used with the square loss, estimating the 
parameter vector w is done by solving a linear system. One of the main drawbacks of this penalty 
function is the fact that it does not perform variable selection and thus does not behave well in 
sparse high-dimensional settings. 

Hence, it is natural to penalize linear models by the number of variables used by the model. 
Unfortunately, this criterion, sometimes denoted by || • ||o (^o-penalty), is not convex and solving 
the problem in Eq. ([IJ is generally NP-hard [18]. Thus, a convex relaxation for this problem 
was introduced, replacing the size of the selected subset by the £i-norm of w. This estimator is 
known as the Lasso [3] in the statistics community and basis pursuit [5] in signal processing. It 
was later shown that under some assumptions, the two problems were in fact equivalent (see for 
example [T^] and references therein). 

When two predictors are highly correlated, the Lasso has a very unstable behavior: it may 
only select the variable that is the most correlated with the residual. On the other hand, the 
Tikhonov regularization tends to shrink coefficients of correlated variables together, leading to a 
very stable behavior. In order to get the best of both worlds, stability and variable selection, Zou 
and Hastie introduced the elastic net |10j . which is the sum of the £i-norm and squared ^2-norm. 
Unfortunately, this estimator needs two regularization parameters and is not adaptive to the 
precise correlation structure of the data. Some authors also proposed to use pairwise correlations 
between predictors to interpolate more adaptively between the ^-norm and squared ^-norm, by 
introducing the pairwise elastic net [5D] (see comparisons with our approach in Section [5]) . 

Finally, when one has more knowledge about the data, for example clusters of variables that 
should be selected together, one can use the group Lasso [IT]. Given a partition (Si) of the set of 
variables, it is defined as the sum of the ^2-norms of the restricted vectors wg ; : 

k 

||w|| G l = ^2 1 1 w S-i 1 1 2 - 

i=l 

The effect of this penalty function is to introduce sparsity at the group level: variables in a group 
are selected altogether. One of the main drawback of this method, which is also sometimes one of 
its quality, is the fact that one needs to know the partition of the variables, and so one needs to 
have a good knowledge of the data. 

2.2 The ridge, the Lasso and the trace Lasso 

In this section, we show that Tikhonov regularization and the Lasso penalty can be viewed as 
norms of the matrix XDiag(w). We then introduce a new norm involving this matrix. 

The solution of empirical risk minimization penalized by the £i-norm or ^-norm is not equiv- 
ariant by rescaling of the predictors XW, so it is common to normalize the predictors. When 
normalizing the predictors 'X. 1 - 1 ' , and penalizing by Tikhonov regularization or by the Lasso, peo- 
ple are implicitly using a regularization term that depends on the data or design matrix X. In fact, 
there is an equivalence between normalizing the predictors and not normalizing them, using the 
two following reweighted £2 and ^i-norms instead of the Tikhonov regularization and the Lasso: 

||w||! = ^||X«||i^ 2 and HwlU^llxWllaKI. (2) 

These two norms can be expressed using the matrix XDiag(w): 

|| w|| 2 = ||XDiag(w)|| F and ||w||i = ||XDiag(w)|| 2 ,i, 

and a natural question arises: are there other relevant choices of functions or matrix norms? A 
classical measure of the complexity of a model is the number of predictors used by this model, 
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which is equal to the size of the support of w. This penalty being non-convex, people use its 
convex relaxation, which is the i^-norm, leading to the Lasso. 

Here, we propose a different measure of complexity which can be shown to be more adapted in 
model selection settings [3T]: the dimension of the subspace spanned by the selected predictors. 
This is equal to the rank of the selected predictors, or also to the rank of the matrix XDiag(w). 
As for the size of the support, this function is non-convex, and we propose to replace it by a convex 
surrogate, the trace norm, leading to the following penalty that we call "trace Lasso" : 

Q(w) = ||XDiag(w)||*. 

The trace Lasso has some interesting properties: if all the predictors are orthogonal, then, it is 
equal to the ^i-norm. Indeed, we have the decomposition: 

XDiag(w) = £ (||XW|| 2U? .) p^e7, 

where ej are the vectors of the canonical basis. Since the predictors are orthogonal and the e, are 
orthogonal too, this gives the singular value decomposition of XDiag(w) and we get 

p 

||XDiag(w)||, - J2 H x(l) ll2KI = ||XDiag(w)|| 2>1 . 

i=l 

On the other hand, if all the predictors are equal to X^ 1 - 1 , then 

XDiag(w) = X (1) w T , 

and we get ||XDiag(w)||» = HX^I^Hw^ = ||XDiag(w)||i?, which is equivalent to the Tikhonov 
regularization. Thus when two predictors are strongly correlated, our norm will behave like the 
Tikhonov regularization, while for almost uncorrelated predictors, it will behave like the Lasso. 

Always having a unique minimum is an important property for a statistical estimator, as it is 
a first step towards stability. The trace Lasso, by adding strong convexity exactly in the direction 
of highly correlated covariates, always has a unique minimum, and is much more stable than the 
Lasso. 

Proposition 1. If the loss function I is strongly convex with respect to its second argument, then 
the solution of the empirical risk minimization penalized by the trace Lasso, i.e., Eq. Q), is unique. 

The technical proof of this proposition is given in appendix [B] and consists of showing that in 
the flat directions of the loss function, the trace Lasso is strongly convex. 

2.3 A new family of penalty functions 

In this section, we introduce a new family of penalties, inspired by the trace Lasso, allowing us 
to write the ^i-norm, the ^2-norm and the newly introduced trace Lasso as special cases. In fact, 
we note that ||Diag(w)||» = ||w||i and ||p _1 / 2 l T Diag(w)||* = Hw 1 ^)* = ||w|| 2 . In other words, 
we can express the t\ and i^-norms of w using the trace norm of a given matrix times the matrix 
Diag(w). A natural question to ask is: what happens when using a matrix P other than the 
identity or the line vector p~ 1 / 2 l T , and what are good choices of such matrices? Therefore, we 
introduce the following family of penalty functions: 

Definition 1. Let P £ ffi p , all of its columns having unit norm. We introduce the norm f2p as 

fip(w) = ||PDiag(w)||*. 

Proof. The positive homogeneity and triangle inequality are direct consequences of the linearity 
of w i-> P Diag(w) and the fact that || ■ ||* is a norm. Since all the columns of P are not equal to 
zero, we have 

P Diag(w) = w = 0, 
and so, fip separates points and is a norm. □ 
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Figure 1: Unit balls for various value of P T P. See the text for the value of P T P. (Best seen in 
color). 

As stated before, the l\ and ^-norms are special cases of the family of norms we just introduced. 
Another important penalty that can be expressed as a special case is the group Lasso, with non- 
overlapping groups. Given a partition (Sj) of the set {1, ■■■,p}, the group Lasso is defined by 

||w||g£ = ^2 ll w ^lb- 

We define the matrix P GL by 

pGL _ ( 1/ x/l'S'fel if i and j are in the same group Sk, 
otherwise. 



Then, 



Diag(w) = ^— |=wj.. (3) 



Using the fact that (Sj) is a partition of {1, the vectors Is- are orthogonal and so are the 

vectors wg^ . Hence, after normalizing the vectors, Eq. (|3j) gives a singular value decomposition of 
pGL Tjj a g( w ) anc [ so the group Lasso penalty can be expressed as a special case of our family of 
norms: 

||P Gi Diag(w)|U=^||w Sj || 2 = ||w|| Gi . 

Sj 

In the following proposition, we show that our norm only depends on the value of P T P. This 
is an important property for the trace Lasso, where P = X, since it underlies the fact that this 
penalty only depends on the correlation matrix X T X of the covariates. 

Proposition 2. Let P £ M. kxp , all of its columns having unit norm. We have 

fi P (w) = ||(P T P) 1 / 2 Diag(w)|| >t . 

We plot the unit ball of our norm for the following value of P T P (see figure |l])): 
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This shows that, as for the elastic net, our norms interpolate between the £i-norm and the li- 
norm. But the main difference between the clastic net and our norms is the fact that our norms 
are adaptive, and require a single regularization parameter to tune. In particular for the trace 
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Lasso, when two covariates are strongly correlated, it will be close to the ^-norm, while when two 
covariates are almost vmcorrelated, it will behave like the ^-norm. This is a behavior close to the 
one of the pairwise elastic net [20] . 

Proposition 3. Let P £ R fcxp , all of its columns having unit norm. We have 

|| w|| 2 < fip(w) < || w|| i. 



2.4 Dual norm 

The dual norm is an important quantity for both optimization and theoretical analysis of the 
estimator. Unfortunately, we are not able in general to obtain a closed form expression of the dual 
norm for the family of norms we just introduced. However we can obtain a bound, which is exact 
for some special cases: 

Proposition 4. The dual norm, defined by Qp(u) = max u T v, can be bounded by: 

Op(v)<l 

fip(u) < ||PDiag(u)|| op . 

Proof. Using the fact that diag(P T P) = 1, we have 

u T v = tr (Diag(u)P T PDiag(v)) 
< ||PDiag(u)|U|PDiag(v)||*, 

where the inequality comes from the fact that the operator norm || • \\ op is the dual norm of the 
trace norm. The definition of the dual norm then gives the result. □ 

As a corollary, we can bound the dual norm by a constant times the ^oo-norm: 
nj,(u) < ||PDiag(u)|| op < ||P||op||Diag(u)||op = UPlUMIoo. 
Using proposition (pi), we also have the inequality f2 P (u) > HuHoq. 



3 Optimization algorithm 

In this section, we introduce an algorithm to estimate the parameter vector w when the loss 
function is equal to the square loss: £(y, w T x) = \ {y — w T x) 2 and the penalty is the trace Lasso. 
It is straightforward to extend this algorithm to the family of norms indexed by P. The problem 
we consider is 

min oily ~ Xw ll2 + A||XDiag(w)||*. 

We could optimize this cost function by subgradient descent, but this is quite inefficient: computing 
the subgradient of the trace Lasso is expensive and the rate of convergence of subgradient descent 
is quite slow. Instead, we consider an iteratively reweighted least-squares method. First, we need 
to introduce a well-known variational formulation for the trace norm |22j : 

Proposition 5. Let M £ K" xp . The trace norm o/M is equal to: 

||M|L = - inf tr (M T S _1 M) + tr (S) , 

/ T \l/2 

and the infimum is attained for S = I MM 1 



G 



Using this proposition, we can reformulate the previous optimization problem as 

min inf -lly - Xwll 2 , + -w T Diag ( diag(X T S _1 X))w + - tr(S). 

This problem is jointly convex in (w, S) [23) . In order to optimize this objective function by 
alternating the minimization over w and S, we need to add a term tr(S _1 ). Otherwise, the 
infimum over S could be attained at a non invertible S, leading to a non convergent algorithm. 

1/2 

The infimum over S is then attained for S = (XDiag(w) 2 X T + /i^I) 

Optimizing over w is a least-squares problem penalized by a reweighted ^2-norm equal to 
w T Dw, where D = Diag (diag(X T S _1 X)) . It is equivalent to solving the linear system 

(X T X + AD)w = X T y. 

This can be done efficiently by using a conjugate gradient method. Since the cost of multiplying 
(X T X + AD) by a vector is 0(np), solving the system has a complexity of O(knp), where k < p is 
the number of iterations needed to converge. Using warm restarts, k can be much smaller than p, 
since the linear system we are solving does not change a lot from an iteration to another. Below 
we summarize the algorithm: 



Iterative algorithm for estimating w 



Input: the design matrix X, the initial guess w°, number of iteration N, sequence /x^. 
For i = 1...N: 

• Compute the eigenvalue decomposition UDiag(sfc)U T of XDiag(w J_1 ) 2 X T . 

• Set D = Diag(diag(X T S- 1 X)), where S" 1 = UDiag(l/ % /s fe + Jm)V t . 

• Set w l by solving the system (X T X + AD)w = X T y. 



For the sequence /Xj, we use a decreasing sequence converging to ten times the machine precision. 
3.1 Choice of A 

We now give a method to choose the regularization path. In fact, we know that the vector is 
solution if and only if A > £T(X T y) [Mj. Thus, we need to start the path at A = £!*(X T y), 
corresponding to the empty solution 0, and then decrease A. Using the inequalities on the dual 
norm we obtained in the previous section, we get 

IIX^IU < n*(X T y) < ||X|| 0p ||X T y|| 0O . 

Therefore, starting the path at A = ||X|| op ||X T y|| 00 is a good choice. 

4 Approximation around the Lasso 

In this section, we compute the second order approximation of our norm around the special case 
corresponding to the Lasso. We recall that when P = I £ K pxp , our norm is equal to the ^-norm. 
We add a small perturbation A £ S p to the identity matrix, and using Prop. [6] of the appendix [Xj 
we obtain the following second order approximation: 



|(I + A)Diag(w)||* = ||w||x +diag(A) T |w| 

V V ( A 3'KI ~ A *jK'D 2 V V ( A uM) 2 , n([] AII2N 

i„^„ 4(K| + K-|) + l £ in A n 2K| 



|iOi|>0 |>0 vi ' 1 J " \wi\=0 \Wj\>0 
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We can rewrite this approximation as 



||(I + A) Diag(w)||, = ||w|U + diag(A) T |w| + £ ^ji-j L ^ + o( || A ||2 )) 

using a slight abuse of notation, considering that the last term is equal to when Wi — Wj = 0. 
The second order term is quite interesting: it shows that when two covariates are correlated, the 
effect of the trace Lasso is to shrink the corresponding coefficients toward each other. Another 
interesting remark is the fact that this term is very similar to pairwise elastic net penalties, which 
are of the form |w| T P|w|, where P^- is a decreasing function of Ay . 



5 Experiments 

In this section, we perform synthetic experiments to illustrate the behavior of the trace Lasso 
and other classical penalties when there are highly correlated covariates in the design matrix. 
For all experiments, we have p — 1024 covariates and n — 256 observations. The support S of 
w is equal to {1, ...,£;}, where k is the size of the support. For i in the support of w, we have 
Wi — 2 (bi — 1/2), where each hi is independently drawn from a uniform distribution on [0, 1]. 
The observations Xj are drawn from a multivariate Gaussian with mean and covariance matrix 
E. For the first experiment, E is set to the identity, for the second experiment, E is block diagonal 
with blocks equal to 0.21 + 0.811 T corresponding to clusters of eight variables, finally for the 
third experiment, we set Ey = Q.95' 8- -'', corresponding to a Toeplitz design. For each method, we 
choose the best A for the estimation error, which is reported. 

Overall all methods behave similarly in the noiseless and the noisy settings, hence we only 
report results for the noisy setting. In all three graphs of Figure 2, we observe behaviors that 
are typical of Lasso, ridge and elastic net: the Lasso performs very well on sparse models, but its 
performance is rather poor for denser models, almost as poor as the ridge regression. The elastic 
net offers the best of both worlds since its two parameters allow it to interpolate adaptively between 
the Lasso and the ridge. In experiment 1, since the variables are uncorrelated, there is no reason 
to couple their selection. This suggests that the Lasso should be the most appropriate convex 
regularization. The trace Lasso approaches the Lasso as n goes to infinity, but the weak coupling 
induced by empirical correlations is sufficient to slightly decrease its performance compared to that 
of the Lasso. By contrast, in experiments 2 and 3, the trace Lasso outperforms other methods 
(including the pairwise elastic net) since variables that should be selected together are indeed 
correlated. As for the penalized elastic net, since it takes into account the correlations between 
variables it is not surprising that in experiment 2 and 3 it performs better than methods that do 
not. We do not have a compelling explanation for its superior performance in experiment 1. 




Figure 2: Experiment for uncorrelated variables (Best seen in color, en stands for elastic net, pen 
stands for pairwise elastic net and trace stands for trace Lasso.) 
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6 Conclusion 



We introduce a new penalty function, the trace Lasso, which takes advantage of the correlation 
between covariates to add strong convexity exactly in the directions where needed, unlike the 
elastic net for example, which blindly adds a squared l^-norm term in every directions. We 
show on synthetic data that this adaptive behavior leads to better estimation performance. In 
the future, we want to show that if a dedicated norm using prior knowledge such as the group 
Lasso can be used, the trace Lasso will behave similarly and its performance will not degrade 
too much, providing theoretical guarantees to such adaptivity. Finally, we will seek applications 
of this estimator in inverse problems such as deblurring, where the design matrix exhibits strong 
correlation structure. 
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A Perturbation of the trace norm 

We follow the technique used in |25| to obtain an approximation of the trace norm. 



A.l Jordan- Wielandt matrices 

Let M € R" xp of rank r. We note si > s 2 > ... > s r > 0, the strictly positive singular values of 
M and u^, Vj the associated left and right singular vectors. We introduce the Jordan- Wielandt 
matrix 

M= ( M ° T ^ ) €K '" +P)X( " +P ' 

The singular values of M and the eigenvalues of M are related: M has eigenvalues Sj and s_i = — Sj 
associated to eigenvectors 

1 ( Ui \ 1 [ Uj 

Wj = — p= and w_i = — = 

The remaining eigenvalues of M are equal to and are associated to eigenvectors of the form 

If u \ 1 / u 

w = — = and w = — = 

y/2 \ v / \/2 V - v 

where V t G {1, r}, u T Ui = v T v,; = 0. 

A. 2 Cauchy residue formula 

Let C be a closed curve that does not go through the eigenvalues of M. We define 



We have 



n c (M) = — ( \{Xl-M)- l d\. 
2m J c 



j J 



1 'Efl + Y^lw.wJdA 



2iir J ^-r 1 \ A — Sj 



= J2 s i w i w J- 

Sj ec 

A. 3 Perturbation analysis 

Let A £ M. nxp be a perturbation matrix such that ||A|| op < s r /4, and let C be a closed curve 
around the r largest eigenvalues of M and M + A. We can study the perturbation of the strictly 
positive singular values of M by computing the trace of ilc(M + A) — Ilc(M). Using the fact 
that (AI — M - A)- 1 = (AI - M)- 1 + (AI - M)- 1 A(AI — M - A)" 1 , we have 



n c (M + A) - n c (M) = — / A(AI - M)" 1 A(AI - M^dX 



1 

2l7T J ' 



— / A(AI - M)" 1 A(AI - M)" 1 A(AI - M) _1 gJA 

2l7T / 



— / A(AI - M)" 1 A(AI - Mr 1 A(AI — M — A) -1 dA. 
2mt J 
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We note A and B the first two terms of the right hand side of this equation. We have 
tr(A) = £ «r(w,w7Aw„wD^ £ (A _ ^ _ 3>) 

^" <W ' AWi) i|(A^ 

= ^tr(wjAw,) 

j 

= 5^tr(uTAv J -), 



and 

tr(B) = g tr( Wj w7Aw fc wf Aw,wD^ ^ ( , _ _ g|) 

If Sj = Sfe, the integral is nul. Otherwise, we have 



where 



A a b c 

(A - Sj) 2 (X - s fe ) A - Sj A - s fe (A - Sj) 2 ' 



(Sfe - Sj) 2 ' 



b - Sk 

(s k - Sj) 2 

S 3 

c = — . 

Sj S k 

zero. 



3 "K 

Therefore, if Sj and s k are both inside or outside the interior of C, the integral is equal to 
tr(m V V - s fc( w I Aw fc) 2 , \- \- s k (w]Aw k ) 2 

tv{B) - 2^ 2^ r s , - Sk y + 2^ 2^ f 8 _ Sfc)2 

s 3 >o Sfc <o v J fcy Sj <os k >a v ^ ^ 

x - x - s fe (w j r Aw_fe) 2 . ^ x SfeCwI^Awfe) 2 . ^ x (wjAw fc ) 2 

~ ^ ^ (s, + s k ) 2 + 2^ 2^ f a + s \2 + 2^ 2^ s^r 

Sj >0s k >0 v 3 K > Sj>0sk>0 y 3 K> s 3 =0s fc >0 K 

= (wI 3 Aw fc ) 2 (wjAw fc ) 2 

2-^2^/ Si + s k 2^i Z^i s . 

For s 3 > and s k > 0, we have 

wljAwk = \ (ujAv fe - u^Avj) , 
and for Sj = and > 0, we have 

wj Aw fc = i (iu^Avj + ujAv fc ) . 

So 

tr(B) = ^ (UjT Avfc ~ u ^ Avj)2 + X! E (u " Avj)2 + (UjTAvfc)2 

Sj >0s k >0 4(s 3 +s k ) Sj=0sk>0 2s k 
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Now, let Co be the circle of center and radius s r /2. We can study the perturbation of the 
singular values of M equal to zero by computing the trace norm of II Co (M + A) — II Co (M). We 
have 

n Co (M + A) - n Co (M) = — / A(AI - M)- 1 A(AI - M) _1 dA 

2 ™ Jc„ 

+ — I A(AI - M)- 1 A(AI - M)" 1 A(AI - M)" 1 ^ 

+ — I A(AI - M)- 1 A(AI - M)" 1 A(AI - M — A)" 1 ^. 

2i7r J C a 

Then, if we note the first integral C and the second one D, we get 

Xd\ 



If } 
C = X>,w7Aw fe wJ— ^ I> — 



)(\-s k y 



If both Sj and s k are outside int(C ), then the integral is equal to zero. If one of them is inside, 
say Sj, then Sj — and the integral is equal to 

dX 



Co X - s k 

Then this integral is non nul if and only if s k is also inside int(Co). Thus 

C = ^ W J W 7 Aw fe W fe 1 s j eint{C a )^s k eint(C ) 

= Wj-wjAwfewJ 

Sj=0 s k =0 

= WqW^AWoW^, 

where W are the eigenvectors associated to the eigenvalue 0. We have 

T * t x t 1 / 
D = > WjWj Aw fe w fe Aw[W| — <b 

j,k,l lm Jc ° 



(A - Sj)(\ - s fe )(A - si) 



The integral is not equal to zero if and only if exactly one eigenvalue, say Sj, is outside int(Co). 
The integral is then equal to — l/s*. Thus 

D = -WoW^ AWoWjAW^^W 1 - WS- 1 W T AW W ( [AWoW T 

- W W ( [AWS'- 1 W T AWoW ( [, 

where S = Diag(— s, s). Finally, putting everything together, we get 

Proposition 6. Let M = UDiag(s)V T e M" xp , the singular value decomposition of M, with 
U e R nxr , V e R pxr . Let A e R nxp . We have 

||M + A||* = ||M||, + ||Q||,+tr(VU T A) + 

W^-W + kav^i^w + o(|| A|1) _ 

where 

Q = U,[AVo - UjAV V ( [A T UDiag(s)- 1 

- Diag(s)- 1 V T A T U U t [AVo - U^AV Diag(s)- 1 U T AV . 
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B Proof of proposition 1 



In this section, we prove that if the loss function is strongly convex with respect to its second 
argument, then the solution of the penalized empirical risk minimization is unique. 

Let w £ argmin w ^-{iJii w Tx i) H- A| 1 Diag(w)||*. If w is in the nullspace of X, then w = 
and the minimum is unique. From now on, we suppose that the minima are not in the nullspace 
ofX. 

Let u,v £ argmin w ^(j/i, w T Xi) + A||XDiag(w)||* and S = v — u. By convexity of the 
objective function, all the w = u + tS, for t €]0, 1[ are also optimal solutions, and so, we can choose 
an optimal solution w such that u>i 7^ for all i in the support of S. Because the loss function is 
strongly convex outside the nullspace of X, S is in the nullspace of X. 

Let XDiag(w) = UDiag(s)V T be the SVD of XDiag(w). We have the following development 
around w: 

||XDiag(w + i<5)||* = ||XDiag(w)||* + tr(Diag(i<5)X T UV T )+ 

^ ^ tr(Diag(M)X T (u. t v7 - u jV J)) 2 tr(Diag(M)X T u t vJ) 2 

4( Si + s.-) + 2^2^ 2St +0{ ) - 

« < >0« J >0 v * JJ Si>0s j= 1 

We note S the support of w. Using the fact that the support of 5 is included in S, we have 
XDiag(t^) = XDiag(w) Diag(t7), where 7, = ^ for i £ S and otherwise. Then: 

||XDiag(w + ^)||, = ||XDiag(w)||» + t 7 T diag(VDiag(s)V T ) + 

>-> ^ t 2 tr ({sj ~ s/) Diag(7)v t vJ) 2 „ „ t 2 tr (sj Diag(7)vjvJ) 2 

s;>0sj>0 v 1 ■>' Sj>0s 3 =0 

For small t, w + tS is also a minimum, and therefore, we have: 

V Si > 0, s, > 0, ( Si - Sj) tr (Diag( 7 )vivT) = 0, (4) 

V Si > 0, Sj = 0, tr (Diag( 7 )vivT) = 0. (5) 

This could be summarized as 

Vs^s,-, v7(Diag( 7 ) Vj )=0. (6) 

This means that the eigenspaces of Diag(w)X T XDiag(w) are stable by the matrix Diag(7). 
Therefore, Diag(w)X T XDiag(w) and Diag(7) are simultaneously diagonalizable and so, they 
commute. Therefore: 

Vi,j£S, (Jij-ji = aijjj (7) 

where cry = [X T X]i J . We define a partition (Sk) of S, such that i and j are in the same set Sk if 
there exists a path i = a±, a m = j such that cr a „,a„ + i 7^ for all n £ {1, m — 1}. Then, using 
equation (Fn), 7 is constant on each Sk- S being in the nullspace of X, we have: 

= £ T X T X<5 (8) 

= EE^ xTx ^ ( 9 ) 

Sk Si 

= ^5j fc X T X% (10) 

= En x ^ii2- (11) 
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So for all Si, X5s 4 = 0. Since a predictor Xj is orthogonal to all the predictors belonging to other 
groups defined by the partition (Sk), we can decompose the norm £1: 

||XDiag(w)||. = J2 l|XDiag(w s J||,. (12) 

Sk 

We recall that 7 is constant on each Sk and so 6s k is colinear to ws 4 , by definition of 7. If 5^ 
is not equal to zero, this means that ws 4 , which is not equal to zero, is in the nullspace of X. 
Replacing Wj. by will not change the value of the data fitting term but it will strictly decreases 
the value of the norm 0. This is a contradiction with the optimality of w. Thus all the Ss t are 
equal to zero and the minimum is unique. 



C Proof of proposition 3 

For the first inequality, we have 

||w|| 2 = ||PDiag(w)|| F 
< ||PDiag(w)||*. 

For the second inequality, we have 

||PDiag(w)||» = max tr (M T P Diag(w)) 

l|M|| op <l 

= max diag (M T P) T w 

l|M|| op <l V ; 

P 



< max V|MW T PW| \w t 



i=l 

< llwlh. 



The first equality is the fact that the dual norm of the trace norm is the operator norm and the 
second inequality uses the fact that all matrices of operator norm smaller than one have columns 
of £2 norm smaller than one. 
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